testingiauh912fandomcom-20200214-history
3) Reliability
Zahra Fayaz Reliability Introduction ' A fundamental concern in the development and use of language tests is to identify potential sources of error in a given measure of communicative language ability and to minimize the effect of thes factors on that measure. We must be concerned about errors of measurement, or unreliability, because we know that test performance is affected by factors other than the abilities we want to measure. For example, we can all think of factors such as poor health, fatigue, lack of interest or motivation, and test-wiseness , that can affect individuals’ test performance, but which are not generally associated with language ability, and thus not characteristics we want to measure with language tests . to measure, and hence, the reliability of language test scores. When we increase the reliability of our measures, we are also satisfying a necessary condition for validity: in order for a test score to be valid, it must be reliable. Many discussions of reliability and validity emphasize the differences between these two qualities, rather than their similarities. But instead of considering these as two entirely distinct concepts, I believe both can be better understood by recognizing them as complementary aspects of a common concern in measurement - identifying, estimating, and controlling the effects of factors that affect test scores. The investigation of reliability is concerned with answering the question, ‘How much of an individual’s test performance is due to measurement error, or to factors other than the language ability we want to measure. Reliability is one of the most important elements of test quality. It has to do with the consistency, or reproducibility, of an examinee's performance on the test. For example, if you were to administer a test with high reliability to an examinee on two occasions, you would be very likely to reach the same conclusions about the examinee's performance both times. A test with poor reliability, on the other hand, might result in very different scores for the examinee across the two test administrations. If a test yields inconsistent scores, it may be unethical to take any substantive actions on the basis of the test. There are several methods for computing test reliability including test-retest reliability, parallel forms reliability, decision consistency, internal consistency, and interrater reliability. For many criterion-referenced tests decision consistency is often an appropriate choice. '''What Is Reliability? ' Reliability refers to the consistency of a measure. A test is considered reliable if we get the same result repeatedly. For example, if a test is designed to measure a trait (such as introversion), then each time the test is administered to a subject, the results should be approximately the same. Unfortunately, it is impossible to calculate reliability exactly, but it can be estimated in a number of different ways. The generic name for consistency is reliability. Reliability is an essential characteristic of a good test, because if a test doesn’t measure consistently (reliably), then one could not count on the scores resulting from a particular administration to be an accurate index of students’ achievement. One wouldn’t trust bathroom scales if the reading fluctuated according to the temperature or humidity or if the scales had a loose spring. Similarly, we can’t trust the scores from our tests unless we know about the consistency with which they measure. Only to the extent that test scores are reliable can they be useful and fair to students. Technically, reliability shows the extent to which test scores are free from errors of measurement. No classroom test is perfectly reliable because random errors operate to cause scores to vary or be inconsistent from time to time and situation to situation. The goal is to try to minimize these inevitable errors of measurement and thus increase reliability. reliability is used to describe the overall consistency of a measure. A measure is said to have a high reliability if it produces similar results under consistent conditions. For example, measurements of people’s height and weight are often extremely reliable. ' ' 'Difference from validity ' Reliability does not imply validity. That is, a reliable measure that is measuring something consistently, may not be measuring what you want to be measuring. For example, while there are many reliable tests of specific abilities, not all of them would be valid for predicting, say, job performance. In terms of accuracy and precision, Reliability is a more accurate way of describing precision, while validity is a more precise way of describing accuracy. While reliability does not imply validity, a lack of reliability does place a limit on the overall validity of a test. A test that is not perfectly reliable cannot be perfectly valid, either as a means of measuring attributes of a person or as a means of predicting scores on a criterion. While a reliable test may provide useful valid information, a test that is not reliable cannot possibly be valid. An example often used to illustrate the difference between reliability and validity in the experimental sciences involves a common bathroom scale. If someone who is 200 pounds steps on a scale 10 times and gets readings of 15, 250, 95, 140, etc., the scale is not reliable. If the scale consistently reads "150", then it is reliable, but not valid. If it reads "200" each time, then the measurement is both reliable and valid. 'Factors that affect language test scores ' Measurement specialists have long recognized that the examination of reliability depends upon our ability to distinguish the effects (on test scores) of the abilities we want to measure from the effects of other factors. That is, if we wish to estimate how reliable our test scores are, we must begin with a set of definitions of the abilities we want to measure, and of the other factors that we expect to affect test scores (Stanley 1971: 362). Thus, both Thorndike (1951) and Stanley (1971) begin their extensive treatments of reliability with general frameworks for describing the factors that cause test scores to vary from individual to individual. These frameworks include general and specific lasting characteristics, general and specific temporary characteristics, and systematic and chance factors related to test administration and scoring. General frameworks such as these provide a basis for the more precise definition of factors that affect performance on tests of specific abilities. 'Types of Reliability ' ' ' '''1) 'Test-Retest Reliability ' To estimate test-retest reliability, you must administer a test form to a single group of examinees on two separate occasions. Typically, the two separate administrations are only a few days or a few weeks apart; the time should be short enough so that the examinees' skills in the area being assessed have not changed through additional learning. The relationship between the examinees' scores from the two different administrations is estimated, through statistical correlation, to determine how similar the scores are. This type of reliability demonstrates the extent to which a test is able to produce stable, consistent scores across time. ' 2) Parallel Forms Reliability ' Many exam programs develop multiple, parallel forms of an exam to help provide test security. These parallel forms are all constructed to match the test blueprint, and the parallel test forms are constructed to be similar in average item difficulty. Parallel forms reliability is estimated by administering both forms of the exam to the same group of examinees. While the time between the two test administrations should be short, it does need to be long enough so that examinees' scores are not affected by fatigue. The examinees' scores on the two test forms are correlated in order to determine how similarly the two test forms function. This reliability estimate is a measure of how consistent examinees’ scores can be expected to be across test forms. Decision Consistency ' In the descriptions of test-retest and parallel forms reliability given above, the consistency or dependability of the test scores was emphasized. For many criterion referenced tests (CRTs) a more useful way to think about reliability may be in terms of examinees’ classifications. For example, a typical CRT will result in an examinee being classified as either a master or non-master; the examinee will either pass or fail the test. It is the reliability of this classification decision that is estimated in decision consistency reliability. If an examinee is classified as a master on both test administrations, or as a non-master on both occasions, the test is producing consistent decisions. This approach can be used either with parallel forms or with a single form administered twice in test-retest fashion. '''Internal Consistency ' ' ' The internal consistency measure of reliability is frequently used for norm referenced tests (NRTs). This method has the advantage of being able to be conducted using a single form given at a single administration. The internal consistency method estimates how well the set of items on a test correlate with one another; that is, how similar the items on a test form are to one another. Many test analysis software programs produce this reliability estimate automatically. However, two common differences between NRTs and CRTs make this method of reliability estimation less useful for CRTs. First, because CRTs are typically designed to have a much narrower range of item difficulty, and examinee scores, the value of the reliability estimate will tend to be lower. Additionally, CRTs are often designed to measure a broader range of content; this results in a set of items that are not necessarily closely related to each other. This aspect of CRT test design will also produce a lower reliability estimate than would be seen on a typical NRT. '''2) 'Interrater Reliability ' ' ' All of the methods for estimating reliability discussed thus far are intended to be used for objective tests. When a test includes performance tasks, or other items that need to be scored by human raters, then the reliability of those raters must be estimated. This reliability method asks the question, "If multiple raters scored a single examinee's performance, would the examinee receive the same score. Interrater reliability provides a measure of the dependability or consistency of scores that might be expected across raters. ' ' 'Summary ' Test reliability is the aspect of test quality concerned with whether or not a test produces consistent results. While there are several methods for estimating test reliability, for objective CRTs the most useful types are probably test-retest reliability, parallel forms reliability, and decision consistency. A type of reliability that is more useful for NRTs is internal consistency. For performance-based tests, and other tests that use human raters, interrater reliability is likely to be the most appropriate method. Fundamental to the development and use of language tests is being able to identify and estimate the effect of various factors on language. test scores. In order to interpret test scores as indicators of a given language ability, we must be sure that they are influenced as much as possible by that ability. Any factors other than the ability being tested that affect test scores are potential sources of error that decrease both the reliability of scores and the validity of their interpretations. Therefore, it is essential that we be able to identify these sources of error and estimate the magnitude of their effect on test scores. Our ability to do this depends upon how we define the various influences on test scores.